Goto

Collaborating Authors

 sparse transformer


O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Neural Information Processing Systems

Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has prompted recent research into sparse Transformers that sparsify the connections in the attention layers. While empirically promising for long sequences, fundamental questions remain unanswered: Can sparse Transformers approximate any arbitrary sequence-to-sequence function, similar to their dense counterparts? How does the sparsity pattern and the sparsity level affect their performance? In this paper, we address these questions and provide a unifying framework that captures existing sparse attention models. We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function. Surprisingly, our results show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n^2$ connections.



encouraged that reviewers find our paper clear and well written (R1, R2, R3) and our method to be theoretically sound

Neural Information Processing Systems

We would like to thank the reviewers for their helpful comments and their thorough evaluation of our work. Reversible layers is a technique introduced by Gomez et al. (2017) and is orthogonal and In contrast, clustered attention places no such restriction. We will also add Set Transformers to the related work section. Is speech favorable to clustering? We would like to mention that our NLP approximation experiment for GLUE and SQuAD tasks in 4.3 shows that NLP/vision tasks in the long context setting, as suggested.


A Supplementary Materials

Neural Information Processing Systems

A.1 Dataset Description We describe the additional details of each dataset in the followings. And we use the first 90% for the training set and the last 10% as the validation set. And we leave the last 210 days as the test set. We further evaluate the sensitivities of different hyper-parameters. The distribution of attention densities of different α is shown in Figure 2.




Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior

Han, Fuqun, Osher, Stanley, Li, Wuchen

arXiv.org Machine Learning

Modern generative models, such as neural ordinary differential equations (neural ODEs) [4], transformers [25], and diffusion models [22], have demonstrated remarkable ability to learn and generate samples from complex, high-dimensional probability distributions. These architectures have achieved broad success in scientific computing, image processing, and data science, offering scalable frameworks for data-driven modeling. However, training and sampling in such spaces remain expensive and highly sensitive to architectural and optimization choices. Despite these advances, the curse of dimensionality continues to present a fundamental challenge in many real-world applications. Fortunately, numerous problems in scientific computing exhibit intrinsic structures, such as sparsity, low-rank representations, or approximate invariances, that can be interpreted as prior information about the underlying data or operators. Leveraging such priors within generative models offers a promising avenue to improve both computational efficiency and generalization. A classical way to incorporate prior information, such as sparsity or piecewise regularity, is through Bayesian modeling, where the posterior combines a prior distribution encoding structural knowledge with a likelihood function derived from observations.



encouraged that reviewers find our paper clear and well written (R1, R2, R3) and our method to be theoretically sound

Neural Information Processing Systems

We would like to thank the reviewers for their helpful comments and their thorough evaluation of our work. Reversible layers is a technique introduced by Gomez et al. (2017) and is orthogonal and In contrast, clustered attention places no such restriction. We will also add Set Transformers to the related work section. Is speech favorable to clustering? We would like to mention that our NLP approximation experiment for GLUE and SQuAD tasks in 4.3 shows that NLP/vision tasks in the long context setting, as suggested.


A Supplementary Materials

Neural Information Processing Systems

A.1 Dataset Description We describe the additional details of each dataset in the followings. And we use the first 90% for the training set and the last 10% as the validation set. And we leave the last 210 days as the test set. We further evaluate the sensitivities of different hyper-parameters. The distribution of attention densities of different α is shown in Figure 2.